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Abstract 


Deep learning methods are often used for image classification or local object 
segmentation. The corresponding test and validation data sets are an integral part 
of the learning process and also of the algorithm performance evaluation. High 
and particularly very high-resolution Earth observation (EO) applications based 
on satellite images primarily aim at the semantic labeling of land cover structures 
or objects as well as of temporal evolution classes. However, one of the main EO 
objectives is physical parameter retrievals such as temperatures, precipitation, and 
crop yield predictions. Therefore, we need reliably labeled data sets and tools to 
train the developed algorithms and to assess the performance of our deep learning 
paradigms. Generally, imaging sensors generate a visually understandable repre- 
sentation of the observed scene. However, this does not hold for many EO images, 
where the recorded images only depict a spectral subset of the scattered light field, 
thus generating an indirect signature of the imaged object. This spots the load of 
EO image understanding, as a new and particular challenge of Machine Learning 
(ML) and Artificial Intelligence (AI). This chapter reviews and analyses the new 
approaches of EO imaging leveraging the recent advances in physical process-based 
ML and AI methods and signal processing. 


Keywords: Earth observation, synthetic aperture radar, multispectral, 
machine learning, deep learning 


1. Introduction 


This chapter introduces the basic properties, features, and models for very 
specific Earth observation (EO) cases recorded by very high-resolution (VHR) 
multispectral, Synthetic Aperture Radar (SAR), and multi-temporal observations. 
Further, we describe and discuss procedures and machine learning-based tools to 
generate large semantic training and benchmarking data sets. The particularities 
of relative data set biases and cross-data set generalization are reviewed, and an 
algorithmic analysis frame is introduced. Finally, we review and analyze several 
examples of EO benchmarking data sets. 
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In the following, we describe what has to be taken into account when we want to 
benchmark the classification results of satellite images, in particular the classifica- 
tion capabilities, throughputs, and accuracies offered by modern machine learning 
and artificial intelligence approaches. 

Our underlying goal is the identification and understanding of the semantic 
content of satellite images and their application-oriented interpretation from 
a user perspective. In order to determine the actual performance of automated 
image classification routines, we need to find and select test data and to analyze 
the performance of our classification and interpretation routines in an automated 
environment. 

A particular point to be understood is what type of data exists for remote 
sensing images that we want to classify. We are faced with long processing chains 
for the scientific analysis of image data, starting with uncalibrated “raw” sensor 
data, followed by dedicated calibration steps, subsequent feature extraction, object 
identification and annotation, and ending with quantitative scientific research 
and findings about the processes and effects being monitored in the geophysical 
environment of our planet with respect to climate change, disaster risks, crop yield 
predictions, etc. 

In addition, we have to mention that free and open-access satellite products 
have revolutionized the role of remote sensing in Earth system studies. In our case, 
the data being used are based on multispectral (i.e., multi-color) sensors such as 
Landsat with 7 bands, Sentinel-2 [4] with 13 bands, Sentinel-3 with 21 bands, and 
MODIS with 36 bands but also SAR sensors such as Sentinel-1 [6], TerraSAR-X 
[26] or RADARSAT. For a better understanding of their imaging potential, we will 
describe the most important parameters of these images. For multispectral sensors, 
there exists several well-known and publicly available land cover benchmarking 
data sets comprising typical remote sensing image patches, while comparable SAR 
benchmarking data sets are very scarce and dedicated. 

The main aspects being treated are: 


e ML paradigms to support the semantic annotation of very large data sets, 
that is, using hybrid methods integrating Support Vector Machines (SVMs), 
Bayesian, and Deep Neural Networks (DNNs) algorithms in active learning 
paradigms by using initially small and controllable training data sets, and 
progressively growing the volume of labeled data by transfer learning. 


e Proposing solutions to the semantic aspects of the spatial annotations for dif- 
ferent sensor resolutions and spatial scales. 


e Discussing the implications of the sensory and semantic gaps. 


In this chapter, we assume that we can rely on already processed data with suf- 
ficient calibration accuracy and accurate annotation allowing us to understand all 
imaging parameters and their accuracy. We also assume that we can profit from reli- 
ably documented image data and that we can continue with data analytics for image 
understanding and high-level interpretation without any further precautions. 

The latter steps have to be organized systematically in order to guarantee reliable 
results. A common strategy is to split these tasks into three phases, namely initial 
basic software functionality testing; second, training and optimizing of the soft- 
ware parameters by means of selected reference data, and finally, benchmarking of 
the overall software functionality such as processing speed and attainable results. 
This systematic approach leads to quantifiable and comparable results as described 
in the following sections. 
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During the last years, the field of deep learning had an explosive expansion in 
many domains with predominance in computer vision, speech recognition, and 
text analysis. For example, during 2019, more than 500 articles per month have 
been published in the field of deep learning. Thus, any reports on the state of the 
art hardly can follow this development. In Ref. [1], published in January 2019, more 
than 330 references were analyzed reviewing the theoretical and architectural 
aspects for Convolutional Neural Networks (CNNs), Recurrent Neural Networks 
(RNNs), including Long Short-Term Memories (LSTMs) and Gated Recurrent 
Units (GRUs), Auto-Encoders (AEs), Deep Belief Networks (DBNs), Generative 
Adversarial Networks (GANs), and Deep Reinforcement Learning (DRL). The 
review paper [1] also summarizes 20 deep learning frameworks, two standard 
development kits, 49 benchmark data sets in all domains, from which three are 
dedicated to hyperspectral remote sensing. In addition, Ball et al. [2] describe the 
landscape of deep learning from all perspectives, theory, tools, applications, and 
challenges as of 2017. This article analyzes 419 references. A more recent overview 
from April 2019 [3] summarizes more than 170 references reporting on applications 
of deep learning in remote sensing. 


2. Remote sensing images 


Typical remote sensing images acquired by aircraft or satellite platforms can 
be characterized based on the operational capabilities of these platforms (such as 
their flight path, their capabilities for instrument pointing, and the on-board data 
storage and data downlink capacities), the type of instruments and their sensors 
(such as optical images with distinctive spectral bands [4, 5] or radar images such 
as synthetic aperture radars [6]), and opportunities for the repetitive acquisition of 
geographically overlapping image time series (for instance, for vegetation monitor- 
ing to predict optimal crop harvesting dates). 

Current images can provide raw data with more than eight bits per sample, can 
perform initial data processing and annotation already on board, and can downlink 
compressed data with error correcting codes. After downlinking the image data 
to ground stations, the received data will be stored and processed by dedicated 
computing facilities. A common remote sensing strategy is to perform a systematic 
level-by-level processing (generating so-called products that comprise image data 
together with metadata documenting relevant image acquisition and processing 
parameters). 

A common conventional approach is to follow a unified concept, where Level-0 
products contain unprocessed but re-ordered detector data; Level-1 data represent 
radiometrically calibrated intensity images, while Level-2 data are geometrically 
corrected and map-projected data. Level-3 data are higher level products such as 
semantic maps or overlapping time-series data. In general, users have access to dif- 
ferent product levels and can access and download selected products from databases 
via image catalogs and so-called quick-look (also called thumb-nail) images. 

Some additional products have to be generated interactively by the users. Typical 
examples are image content classifications and trend analyses following mathemati- 
cal approaches. Today, these interactive steps migrate from purely interactive and 
simple tools to commonly accepted machine learning tools. At the moment, the 
majority of machine learning tools use “deep” learning approaches; here, the prob- 
lem is decomposed into several layers to find a good representation of image content 
categories [7]. These aspects will be dealt with in more detail in Section 4. 

What we have to outline first are some important parameters of remote sensing 
images. One critical point of typical remote sensing images is their enormous size 
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calling for big data environments with powerful processors and large data stores. 
A second important point is the geometrical and radiometrical resolution of the 
image pixels, resulting in different target types that can be identified and discrimi- 
nated during classification. While the typical pixel-to-pixel spacing of air-borne 
cameras corresponds to centimeters on the ground, space-borne instruments with 
high resolution flown on low polar orbits mostly lie in the range of half a meter 

to a few meters. In contrast, imaging from more distant geostationary or geosyn- 
chronous orbits results in low-resolution images. As for the number of brightness 
levels of each pixel, modern cameras often provide more than eight bits of resolu- 
tion. Table 1 shows some typical parameters of current satellites with imaging 
instruments. 

Further, the pixels of an image can be complemented by additional informa- 
tion obtained by feature extraction and automated object identification (used as 
image content descriptors) as well as publicly available information from auxiliary 
external databases (e.g., geographical references or geophysical parameters). These 
data allow the provision of accurate quantitative results in physical units; however, 
one has to be aware of the fact that while many phenomena become visible, some 
internal relationships may remain invisible without dedicated additional investiga- 
tions. Table 1 shows some typical parameters of current satellite images. 

In addition to the standard image products as described above, any additional 
automated or interactive analysis and interpretation of remote sensing images calls 
for intelligent strategies how to quickly select distinct and representative images, 
how to generate image time series, to extract features, to identify objects, to recog- 
nize hitherto hidden relationships and correlations, to exploit statistical descriptive 
models describing additional relationships, and to apply techniques for the annota- 
tion and visualization of global/local image properties (that have to be stored and 
administered in databases). 

While typical traditional image content analysis tools either use full images, 
sequences of small image patches, collections of mid-size image segments or count- 
less individual pixels together with routines from already established toolboxes 
(e.g., Orfeo [9]), or advanced machine learning approaches exploiting innovative 
machine learning strategies, as for instance, transfer learning [8] or the use of 
adversarial networks [10]. However, any use of advanced approaches requires the 


High-resolution 
imaging instruments 


Optical cameras and 
spectrometers 


SAR instruments 


Image size (typ. lines x 
columns) 


10* x 10* pixels 


104 x 10* pixels 


Bands (typ.) 


300 to 1000 nm and infrared 
bands 


C-band, X-band, L-band, etc. 


Spatial resolution 


(typ.) 


Sub-meter to tens of m 


Meters to several meters 


Typical imaging parameters of current satellites. 


Target areas (typ.) Land, ocean, ice, atmosphere Land, ocean, ice 

Special modes (typ.) Dynamical targeting stereo V and H polarization scan modes 
views fusion of bands interferometry 

Pixel types (typ.) Detector counts reflectances Complex-valued “detected data” 

(amplitudes or intensities) 
Important parameters Number of overlapping bands Viewing/incidence angle polar or geo. orbit 
(typ) 
Table 1. 
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Clipping of outliers and de-noising 


Color coding of brightness levels 


Histogram manipulation (e.g., stretching) 


Normalization and contrast enhancement 


Box-car filtering (e.g., high-pass filtering, smoothing) 


Transformations and filtering of coefficients 


Analysis of pixel statistics and use of computer vision algorithms (e.g., histograms of gradients, local binary 
patterns, speeded-up robust features) 


Feature extraction and classification (edges, corners, ridges, texture, color, interest points, shapes) 


Extraction of content-oriented regions and objects 


Table 2. 
Typical capabilities of traditional image content analysis tools. 


preparation and conduction of tests that allow a benchmarking of the new soft- 
ware routines, notably methods and tools to generate and analyze data for testing, 
training, verification, and final benchmarking. These testing activities have to be 
supported by efficient visualization tools. 

As can be seen from Table 2, there exist already quite a number of traditional 
image content analysis tools. Some of them generate pre-processed images for sub- 
sequent analysis by human image interpreters, while others allow the identification 
and extraction of objects. However, these tools do not yet exploit the most recent 
automated machine learning techniques. 


3. Machine learning, artificial intelligence, and data science for remote 
sensing 


Currently, we see a lot of public interest in machine learning (ML), artificial 
intelligence (AI), and data science (DS). We have to make sure what we mean by 
these buzzwords: 


e ML is often used if we describe technical developments where a computer 
system is trained and used to find and classify objects in data sets. A prominent 
example is the identification and interpretation of traffic signs for automated 
driving, typically use cases where a computer system is coupled with a camera 
and other sensors, and the traffic signs have to be recognized independent of 
different illumination and weather conditions, a vast range of potential driving 
speeds, varying distances and perspectives, other cars moving within the field 
of view of the camera, supplementary information provided by text panels or 
adjacent traffic signs, and constraints to be observed such as the maximum 
reasonable processing time. In essence, we can consider these applications as a 
reduction of many image pixels into single features (from a given list of cases 
and options) or a combination of features (e.g., max speed of 30 mph except 
on weekends). In most cases, the ML software is tested and trained by many 
typical examples as well as counterexamples. 


e AI combines the full functionality of ML with additional decision-making and 
reaction capabilities. This additional decision-making can be implemented 
by continuous understanding of the current overall situation, the extrac- 
tion of reactions from given rule sets (supported by continuously updated 
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parameters), and the handling of unexpected emergency cases. In the case 
of autonomous driving, one can think of a lane change on a motorway after 
a reason for a lane change has been found, and from a number of alternative 
reactions, when a lane change appears as the best reaction. Then the cur- 
rent situation has to be checked when a lane change becomes possible, and a 
sequence of subsequent actions is executed. 


e DS asa scientific and technical discipline of its own shall provide all guiding 
principles that are needed from end-to-end system design up to data analytics 
and image understanding—including the system layout and verification, the 
selection of components and tools, the implementation and installation of the 
components and their verification, and the benchmarking of the full func- 
tionality. In the case of remote sensing applications, we also have to include all 
aspects of sensor calibration, comparisons with the findings of other research- 
ers via Internet, and traceable scientific data interpretation. 


As our applications mostly use cases dealing with remote sensing images, we can 
limit ourselves to the main ML paradigms that support the semantic annotation of 
very large data sets. Based on the current state-of-the art developments, we consider 
that there are three currently important fundamental and internationally accepted 
image classification approaches for remote sensing applications and two additional 
learning principles useful for satellite images: 


e Bayesian networks: a Bayesian network consists of a probabilistic graphical 
model representing a set of variables together with their conditional depen- 
dencies. It can be used for parameter learning and is based on traditional 
formulas derived by Bayes [11]. 


e Support Vector Machines (SVMs): SVMs support classification and regres- 
sion tasks by identifying basic support points that are used to define a 
robust separation plane between all sample points. In general, the resulting 
separation plane is a hyperplane with nonlinear characteristics. In order 
to obtain a separation plane with linear characteristics, the sample points 
are mapped into a higher-dimensional system with linear characteristics. 
This mapping exploits so-called kernel functions [12]. A well-known SVM 
software package is [13], which also explains how to train and verify a 
new SVM. 


e Neural Networks: neural networks follow the concept of biological neurons 
that trigger a positive response if the input signal corresponds to a known 
object. Thus, technical implementations mostly consist of three levels, namely 
a visible input layer followed by an internal processing layer that is not visible 
to the user (in principle, an artificial neural network), and a visible output 
layer. An extension of general neural networks are deep neural networks; 
here, the processing layer is split into several linked internal sublayers that 
allow a more detailed analysis of the input data (e.g., on selected scales). 

The internal network parameters (i.e., the filter coefficients) are derived 
(“learned”) by means of typical (and atypical) image samples and manual 
labeling by users [11]. 


e Active learning: this learning strategy combines automated learning with 
interactive steps involving the user during important decisions. This can be 
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accomplished by a visualization interface where a user can select or deselect 
image patches that do belong to or do not belong to a specific target class. For 
further details, see [14]. 


Transfer learning: the idea of transfer learning is to train a network for a given 
task and then to exploit or “translate” the resulting network parameters to 
another use case. A typical example cited in [8] is the use of knowledge gained, 
while learning to recognize cars in images is applied when trying to recognize 
trucks. 


One of the most critical points for satellite image classification is the depen- 
dence of the classification results on the resolution (pixel spacing) of the images. 
Experiences gained by many authors demonstrate that the identified classes and 
their local assignment within image patches are strongly resolution-dependent as 
higher resolution will often lead to a higher number of visible and identified seman- 
tic categories. Thus, the performance of any semantic interpretation of images must 
be considered as a data-dependent metric: this potential difficulty should prevent 
us from blind-folded direct performance comparisons. 

Another similar point to be mentioned is the risk of sensory and semantic gaps 
encountered during image classification. Sensory gaps result from cases where a 
sensing instrument cannot measure the full range of potential cases with all their 
physical effects and details that could exist in a real-world scene and that we cannot 
record and identify with uniform confidence. A similar potential pitfall for image 
understanding can result from semantic gaps. For instance, during interactive 
labeling by test persons, different people could assign different categories to image 
patches due to their educational background, professional experiences, etc. For 
further details, see [15]. 

The number of available approaches, algorithms, and tools is growing continu- 
ously. Some examples have become very widespread in academia such as Caffe 
[16], TensorFlow [17], and PyTorch [18]. In contrast to these established solutions, 
a large number of fresh publications are submitted every day. As an example, the 
ArXiv preprint repository [19] collects in its “computer science” and “statistics” 
directories hundreds of new machine learning papers per day. 


4. Networks for deep learning 


Many experiments with image classification systems have shown that traditional 
single-level (“shallow”) algorithms are less performant than multi-level (“deep”) 
concepts where distinct filtering operations are applied on each level, and the 
results of the previous levels can be used on each deeper level; the final result will be 
obtained by combining the specific results of each separate level. The reason for the 
better performance of multi-level algorithms is that one can apply distinct filters 
specifically tailored to each level. Typical examples are multi-resolution filters that 
detect image characteristics on several scales: when we look at satellite images of 
urban settlements, then a business district normally has larger high-rise buildings 
and broader streets than a residential suburb with interspersed low-rise buildings 
and individual gardens. 

From a high-level perspective, we can say that learning works best with deep 
learning approaches exploiting dedicated “network” structures. Here, we under- 
stand networks as design structures of the data flows and the arrangement of pixel 
handling steps governing the processing of our images. This concept also supports 
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more intricate label assignment concepts such as primary labels defining the main 
category of an image patch supplemented by secondary labels that provide addi- 
tional information about “mixed classes” or supplementary spatial details of a given 
image patch. 

In the meantime, some types of networks have emerged that have proven their 
robustness in the case of satellite images to be annotated semantically. In the fol- 
lowing, we list four types of networks that have proven their usefulness for satellite 
image interpretation: 


¢ Deep Neural Networks (DNNs): as described in [20], these networks consist 
of several layers and comprise an input layer, an output layer, and at least one 
hidden layer in between. Each layer performs dedicated pixel processing. The 
corresponding training phase can be understood as deep learning. 


e Recursive Neural Networks (not to be confused with recurrent neural networks; 
both network types appear as RNNs): when we have structured input data, these 
data can be efficiently handled by recursive neural networks that are often 
being used for speech processing and understanding. Recursive neural net- 
works can also be used for natural scenes such as images containing recursive 
structures [23]. RNN algorithms identify the units that an image contains and 
how the units interact. Thus, one can use RNNs for semantic scene segmenta- 
tion and annotation. 


e Convolutional Neural Networks (CNNs): these networks have been conceived 
for low-error classification of big images with a very large number of classes. 
As described by [21], one can classify more than a million images and assign 
more than 1000 different classes. This is accomplished internally by five con- 
volutional layers, three fully connected layers, and a million internal param- 
eters. To reduce overfitting, the method applies regularization by disregarding 
offending elements (“dropout method”). 


Generative Adversarial Networks (GANs): an adversarial network allows the 
mutual training of two competing multilayer perceptron models G and D 
following an adversarial process: G determines the data distribution, while D 
estimates the probability that a sample comes from training data rather than 
from D. In addition, D maps the high-dimensional input data to semantic 
category labels. For further details, see [24]. 


Besides the network types listed above, we also need an overall algorithmic 
architecture embedding the networks. For our applications, a “U” approach has 
proven to be a useful concept for satellite image content analysis. A “U” approach 
contains a descending branch followed by an ascending branch and is conceived for 
handling a progressively shrinking number of elements until a final core element 
(a main category) is found, followed by stepwise complementary semantic infor- 
mation. Further details can be found in [21]. 

In our experience, most general remote sensing applications can be solved 
efficiently by CNNs or similar approaches. However, quite a number of innova- 
tive alternatives have been proposed during the last years, for example, common 
auto-encoders, recursive approaches for time series, and adversarial networks 
for fast learning with only a few examples. In our case, we suggest to use CNNs 
for non-critical satellite image applications, while highly complicated or time- 
critical applications could call for innovative approaches as already described 
above. 
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5. Training and benchmarking 


When we train a classification network and verify its performance, the main 
goal is to train the system for correct category assignments resp. semantic annota- 
tions (labels), that is, to add supplementary information to each satellite image 
patch that we analyze. 

The semantic annotations can either be learned in a preparatory phase or be 
taken from catalogs of already existing categories. If we aim at long-term analyses 
of satellite images, a good approach is to use the same catalogs during the entire 
lifetime of the analysis or to re-run the entire system with updated catalogs. 

The easiest approach is to select typical examples for each category and to assign 
the given labels to all new image data. However, if we follow this straightforward 
approach, we will probably encounter some difficulties when image patches with 
unexpected content arrive. A first remedy is to add an additional “unknown” 
category and to assign this label to all image patches that do not fit well to one of the 
given categories. Further, experience with machine learning systems has shown that 
good classification results can also be reached when we systematically select positive 
as well as negative examples (i.e., counterexamples) for each category leading to 
a comprehensive coverage and understanding of each category. This process can 
be accomplished manually by knowledgeable operators (i.e., image interpretation 
experts) [22]. Another approach is data augmentation: If we do not have sufficient 
examples of a necessary category, one can create additional realistic data by simply 
flipping or rotating already available images. 

This simple example leads us to systematic methods for a database creation. 
One has to find a comprehensive and fairly balanced set of examples that covers 
the expected total variety of cases. Thus, we avoid so-called database biases [23]. 
In addition, one has to make sure that the inclusion of additional examples does 
not lead to overfitting or excessive runtimes. This can be accomplished by setting 
up a validation testbed where these potential pitfalls can be tested, trained, and 
where the final performance of the created database structure can be verified. 
One has to be aware of the fact that database access times may strongly depend on 
the available computer systems, their interconnections, and the selected type of 
database. 

These approaches led to a number of publicly available databases with label 
annotations for civilian remote sensing data. There are several semantically anno- 
tated databases based on optical (most often multispectral) data, while there are 
only a few databases based on SAR data. Some advanced remote sensing database 
examples are [25-27]. Of course, their general applicability and transferability 
depend on the actual image resolution, the imaging geometry, and the noise content 
of the images. Current state-of-the-art systems are being assessed based on end-to- 
end tests covering also inter alia practical aspects such as the runtime depending 
on the database design and the selected test images, the amount and organization 
of available labels, the correctness of the obtained annotations, and the overall 
implementation and validation effort. 


6. Perspectives 


As for remote sensing images, there exist already several semantically annotated 
collections of typical high-resolution satellite images—a number of collections 
of optical images and a few collections of SAR images. However, these collections 
often seem to be potpourris of interesting snapshots rather than systematically 
selected samples based on regionally typical target classes and their visibility as a 
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function of different instrument types. The situation is aggravated by the current 
lack of systematically selected benchmarking data that could be used as well-known 
reference data for quality and performance assessments such as classification tasks 
or throughput testing. 

These deficiencies have to be solved in the near future as more and more high- 
resolution images become publicly available, while the end-users already expect 
reliable automated image classification and content understanding results for more 
and more high-level applications. We can expect that the progress in deep learn- 
ing will also lead to much progress in many other fields of image processing, even 
beyond the field of remote sensing; thus, remote sensing should be aware of what 
is published by the image processing and environmental protection communities at 
large. 


7. Conclusions 


While high-resolution imaging has made much progress for many remote 
sensing applications, standardized image classification benchmarking still deserves 
more progress. On the one hand, several benchmarking concepts and tools could 
still be gleaned from other disciplines; on the other hand, an optimal solution of 
test cases for SAR image interpretation still needs more progress in basic approaches 
of how to verify actual image classification results and the identification of dubious 
cases. 
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